[doc] Adjusted yuanrong backend doc by dpj135 · Pull Request #104 · Ascend/TransferQueue

dpj135 · 2026-05-18T12:34:13Z

Description

I've updated the description in the Yuanrong backend documentation, adding more usage guidance.

Main changes

Add more detailed descriptions regarding installation and usage.
Adjust demos and use transfer_queue.init() to start TransferQueue&Yuanrong.
Add instructions for manually launching Yuanrong when auto_init=False.
Add FAQ to record common issues during the use of Yuanrong.

ascend-robot · 2026-05-18T12:34:24Z

CLA Signature Pass

dpj135, thanks for your pull request. All authors of the commits have signed the CLA. 👍

ascend-robot · 2026-05-19T11:01:15Z

CLA Signature Pass

dpj135, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Signed-off-by: dpj135 <958208521@qq.com>

ascend-robot · 2026-05-20T04:43:07Z

CLA Signature Pass

dpj135, thanks for your pull request. All authors of the commits have signed the CLA. 👍

ascend-robot · 2026-05-20T06:47:41Z

CLA Signature Pass

dpj135, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Signed-off-by: dpj135 <958208521@qq.com>

ascend-robot · 2026-05-20T07:56:28Z

CLA Signature Pass

dpj135, thanks for your pull request. All authors of the commits have signed the CLA. 👍

tianyi-ge · 2026-05-20T08:26:56Z

+## Quick Start

 ### Prerequisites
 - **Python Version**: $ \geq 3.10~and \leq 3.11 $


this line is not correctly rendered by markdown. just use >= 3.10, <=3.11

tianyi-ge · 2026-05-20T08:31:39Z

  - `--shared_memory_size_mb`: Shared memory size in MB for datasystem worker.
-  - `--remote_h2d_device_ids`: Enable RH2D (Remote Host-to-Device) for efficient cross-node data transfer. Specify NPU device IDs as comma-separated values (e.g., `0,1,2,3`).
-  - `--enable_huge_tlb`: Enable huge page memory, required for >21GB shared memory on Ascend 910B.
+  - `--enable_huge_tlb`: Configure huge page memory to reduce TLB misses and improve memory access efficiency. Note: may cause system memory shortage, kernel OOM, or system instability. Required for >21GB shared memory on Ascend 910B.


I think it's better to remind the users to allocate huge pages before starting datasystem. you may link to datasystem huge page doc https://pages.openeuler.openatom.cn/openyuanrong-datasystem/docs/zh-cn/latest/appendix/hugepage_guide.html

tianyi-ge · 2026-05-20T08:33:51Z


-Next, we will provide deployment and code examples for single-node scenarios.
-For multi-node scenarios, please refer to [Appendix B](#B-deploy-multi-node-datasystem-for-multi-node-training-and-inference-scenarios).
+When `auto_init: True` is set in the configuration, TransferQueue automatically initializes the Yuanrong backend during `tq.init()`. The deployment process:


it's better to tell readers that yuanrong is per-host deployment. it manages all clients on the same node, in case some users may be mistaken and think yr backend is per-client

tianyi-ge · 2026-05-20T08:36:42Z

+**NPU Transfer Options:**
+- `enable_yr_npu_transport`: Enable NPU transport for high-performance device-to-device data transfer. Set to `true` when using NPU tensors.
+- `worker_args` (recommended when `enable_yr_npu_transport: true`):
+  - `--remote_h2d_device_ids`: Enable RH2D (Remote Host-to-Device) for efficient cross-node NPU data transfer. Specify NPU device IDs as comma-separated values (e.g., `0,1,2,3`).


yr manages all the specified devices. If you want to set/get tensors on npu x, you need to include the device id x in this argument.

tianyi-ge · 2026-05-20T08:38:02Z

+
+```bash
+# On head node
+ray start --head --resources='{"node:192.168.0.1": 1}'


I remember that haichuan said resources for node ip is not necessary. if it's true, this start cmd can be simplified

This is for controlling placements of ray actors

tianyi-ge · 2026-05-20T08:39:25Z

+TransferQueue will detect all Ray nodes and deploy datasystem workers automatically.

-Once the configuration is set, you can run your TransferQueue + Datasystem application directly.
+#### Multi-Node Demo


add a short line to remind the users which lines are required to be modified (node ips) before giving them a big chunk of code

tianyi-ge · 2026-05-20T08:40:34Z

+If `worker_port` or `metastore_port` is already in use, initialization will fail:
+
+```
+RuntimeError: Failed to start datasystem worker...


port conflict is the only possible reason of failed to start datasystem worker?

Add more situations

tianyi-ge · 2026-05-20T08:41:28Z

+# Clean up
+dscli stop --worker_address <IP>:31501
+# Or force cleanup
+pkill -f dscli


kill dscli or kill datasystem_worker?

tianyi-ge · 2026-05-20T08:42:49Z

+pkill -f dscli
+```
+
+### Multi-Process Initialization


why is this an FAQ?

Users may be confused about how to init yuanrong-worker with multiple processes. This is for explaining the process of tq.init()

KaisennHu

Overall looks good. Some minors.

KaisennHu · 2026-05-20T08:36:39Z

+
+**NPU Transfer Options:**
+- `enable_yr_npu_transport`: Enable NPU transport for high-performance device-to-device data transfer. Set to `true` when using NPU tensors.
+- `worker_args` (recommended when `enable_yr_npu_transport: true`):


When enable_yr_npu_transport is set to true, remote_h2d_device_ids is mandatory instead of recommended.

KaisennHu · 2026-05-20T08:47:34Z

+1. **Detects Ray cluster nodes** - identifies all alive nodes in the Ray cluster
+2. **Creates placement group** - uses `STRICT_SPREAD` strategy to ensure workers are distributed across nodes
+3. **Launches YuanrongWorkerActor** - creates one actor per node to manage the datasystem worker
+4. **Sets up metastore service** - the head node (driver node) starts the metastore service, other nodes connect as workers


The symbols ‘-’ are a bit strange

I think it looks not bad. (^w^)

KaisennHu · 2026-05-20T08:51:06Z

+# On head node
+ray start --head --resources='{"node:192.168.0.1": 1}'
+
+# On worker node (assume ray port of head_node is 6379)
+ray start --address="192.168.0.1:6379" --resources='{"node:192.168.0.2": 1}'


To start Ray in an NPU environment, users need to be reminded to add --resources='{"NPU": 4}' or configure ASCEND_RT_VISIBLE_DEVICES.

ascend-robot · 2026-05-22T02:36:59Z

CLA Signature Pass

dpj135, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Copilot

Pull request overview

Updates the Yuanrong storage-backend documentation to provide clearer installation, configuration, deployment, and troubleshooting guidance, and links the guide from the main README.

Changes:

Added a README link to the Yuanrong usage guide.
Restructured and expanded the Yuanrong backend guide with demos (single-node + multi-node), config explanations, and manual startup instructions.
Added an FAQ section covering common deployment/runtime issues.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
README.md	Adds a direct link to the Yuanrong backend usage guide from the supported backends list.
docs/storage_backends/openyuanrong_datasystem.md	Expands and reorganizes Yuanrong backend documentation (install, demos, deployment, manual mode, FAQ).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

dpj135 · 2026-05-22T09:26:08Z

+# Install Torch (recommended version: 2.8.0 or higher)
 pip install torch==2.8.0


Add extra annotation

dpj135 · 2026-05-22T09:26:01Z

 # For root users
 ll /usr/local/Ascend/ascend-toolkit/latest

 # For non-root users
 ll ${HOME}/Ascend/ascend-toolkit/latest
 ```


dpj135 · 2026-05-22T09:25:51Z

+
+#### Option 1: Docker Image (Recommended)
+
+First, select the appropriate [CANN image](https://hub.docker.com/r/ascendai/cann) aligned with your **CANN version**, **Ascend hardware**, **OS**, and **Python version**. For examples:


0oshowero0 · 2026-05-22T04:56:38Z

TQ already has set openyuanrong-datasystem as optional dependency. We can use pip install TransferQueue[yuanrong] to directly install corresponding openyuanrong-datasystem

Signed-off-by: dpj135 <958208521@qq.com>

ascend-robot · 2026-05-22T09:25:16Z

CLA Signature Pass

dpj135, thanks for your pull request. All authors of the commits have signed the CLA. 👍

ascend-robot added the ascend-cla/yes label May 18, 2026

dpj135 force-pushed the fix_yr_init_and_doc branch from 5e20c9d to a80b636 Compare May 19, 2026 11:01

Adjusted yuanrong backend doc

e353874

Signed-off-by: dpj135 <958208521@qq.com>

dpj135 force-pushed the fix_yr_init_and_doc branch from a80b636 to e353874 Compare May 20, 2026 04:42

Used kv interface(Higher API)

0c31647

Signed-off-by: dpj135 <958208521@qq.com>

dpj135 force-pushed the fix_yr_init_and_doc branch from 41f4a62 to 0c31647 Compare May 20, 2026 07:56

dpj135 marked this pull request as ready for review May 20, 2026 07:56

tianyi-ge reviewed May 20, 2026

View reviewed changes

KaisennHu reviewed May 20, 2026

View reviewed changes

0oshowero0 requested a review from Copilot May 22, 2026 04:53

Copilot started reviewing on behalf of 0oshowero0 May 22, 2026 04:54 View session

Copilot AI reviewed May 22, 2026

View reviewed changes

0oshowero0 reviewed May 22, 2026

View reviewed changes

Fixed comments

88b8591

Signed-off-by: dpj135 <958208521@qq.com>

dpj135 force-pushed the fix_yr_init_and_doc branch from bcd05d4 to 88b8591 Compare May 22, 2026 09:25

		# Install Torch (recommended version: 2.8.0 or higher)
		pip install torch==2.8.0


		#### Option 1: Docker Image (Recommended)

		First, select the appropriate [CANN image](https://hub.docker.com/r/ascendai/cann) aligned with your CANN version, Ascend hardware, OS, and Python version. For examples:

Conversation

dpj135 commented May 18, 2026

Description

Main changes

Uh oh!

ascend-robot commented May 18, 2026

CLA Signature Pass

Uh oh!

ascend-robot commented May 19, 2026

CLA Signature Pass

Uh oh!

ascend-robot commented May 20, 2026

CLA Signature Pass

Uh oh!

ascend-robot commented May 20, 2026

CLA Signature Pass

Uh oh!

ascend-robot commented May 20, 2026

CLA Signature Pass

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KaisennHu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ascend-robot commented May 22, 2026

CLA Signature Pass

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Choose a reason for hiding this comment